-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding logic to master_is_stable indicator to check for discovery problems #88020
Adding logic to master_is_stable indicator to check for discovery problems #88020
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @masseyke, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith.
I have some questions about the approach we took here.
Also, would you mind trimming the description to contain only the relevant parts ? (ie. master_is_stable indicator, without impacts and the likes - it seems that only summary and details are affected)
} else if (clusterService.localNode().isMasterNode() == false) { // none is elected master and we aren't master eligible | ||
// NOTE: The logic in this block will be implemented in a future PR | ||
result = new CoordinationDiagnosticsResult( | ||
CoordinationDiagnosticsStatus.RED, | ||
"No master has been observed recently", | ||
CoordinationDiagnosticsDetails.EMPTY | ||
); | ||
} else { // none is elected master and we are master eligible | ||
result = diagnoseOnHaveNotSeenMasterRecentlyAndWeAreMasterEligible(localMasterHistory, masterEligibleNodes, explain); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this if/else block is becoming hard to follow and reason about
ie. where do we check we aren't master eligibile node? the last else
statement has a bunch of implicit decisions that are hard to verify (ie. why are we sure we're master eligible in this case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else if
right above the else checks if we're master eligible. I'll try to make it more explicit.
server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java
Outdated
Show resolved
Hide resolved
* @param nodeToClusterFormationStateMap A map of each master node to its ClusterFormationState | ||
* @return true if there are discovery problems, false otherwise | ||
*/ | ||
private boolean hasDiscoveryProblems( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit ambiguous w.r.t. what it is diagnosing - who has discovery problems?
Maybe we can be more intentional in the method name and also return (or log?) the problems we discover?
ie. who cannot discover which node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally calculating and returning this, but removed that since we're not going to be putting it in the details in the response. I can change it to log that information for now.
* @param nodeToClusterFormationStateMap A map of each master node to its ClusterFormationState | ||
* @return True if any nodes in nodeToClusterFormationStateMap report a problem forming a quorum, false otherwise. | ||
*/ | ||
private boolean hasQuorumProblems( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above - would it be useful to log the problems we discover? Or return them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah at some point we'll be putting them into the details section of the response. For now I'll log them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm - this means they don't currently add any new information compared to what we provide in the generic cluster_formation.description
field. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That was the conclusion to the discussion on the document I shared about this:
W.r.t. structure I don’t think we have a clear indication as to how the details section of the `master_is_stable` indicator is going to be used for now, so I’d
suggest we keep the `ClusterFormationState#description` as the only field in the `master_is_stable` details field for now and add structure at
a later phase.
Once the health API is used in the diagnostics bundle (very soon) we’ll be able to get some more engineers exposed to the `details` field and get some
feedback about the needs and shortcomings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I was just curious if we've gained extra information in these diagnostic steps
Thanks for the confirmation
Would we benefit from attaching #87482 (comment) to the meta issue ? |
Added at #85624 (comment) |
…b.com:masseyke/elasticsearch into feature/health-api-master-stability-discovery
@elasticmachine update branch |
…b.com:masseyke/elasticsearch into feature/health-api-master-stability-discovery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Keith
Left one suggestion.
+ "eligible nodes", | ||
nodeHasMasterLookupTimeframe | ||
), | ||
getDetails(explain, localMasterHistory, null, coordinator.getClusterFormationState().getDescription()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen @DaveCTurner use the cluster formation details from all involved nodes.
I think, since we got hang of all the master nodes view on the cluster formation, we should report each node's view under the details
section.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we definitely need to do that. It sounds like we had a miscommunication earlier -- I thought you wanted that information removed so I removed it. I'll find a way to put it back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK if there is a discovery or quorum problem, I am now putting all of the cluster formation descriptions for all master nodes into the details for debugging purposes.
assertThat(result.summary(), containsString(" some master eligible nodes are unable to discover other master eligible nodes")); | ||
} | ||
|
||
public void testAnyNodeInClusterReportsDiscoveryProblems() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@elasticmachine run elasticsearch-ci/packaging-tests-windows-sample |
@elasticmachine run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Keith
Just a couple more questions left
server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java
Show resolved
Hide resolved
@@ -361,7 +511,8 @@ private CoordinationDiagnosticsResult getResultOnNoMasterEligibleNodes(MasterHis | |||
CoordinationDiagnosticsDetails details = getDetails( | |||
explain, | |||
localMasterHistory, | |||
coordinator.getClusterFormationState().getDescription() | |||
null, | |||
Map.of(coordinator.getLocalNode().getId(), coordinator.getClusterFormationState().getDescription()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be beneficial if the local node's view is expressed separately? Or put it the other way, the remote ones be expressed oner a new object in the representation? remote
or something better named?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would we still want to have the cluster formation expressed as objects for future extension? (in case we'll want to add some structure)
ie:
"localNode": { "description" : " issues" },
"remote" : {
"node1" : { "description" : "issues" },
"node2": { "description" : "other things"}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't really think of any reason, so I put them together to simplify the response. Can you think of a reason it would be useful to have them separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if in large clusters it'll become a bit difficult to determine the local view amongst a list of remote views? Maybe I'm over-optimising.
Could you please update the PR description with the latest responses and let's move ahead with the simple solution for now (we could iterate afterwards to improve it - once we see it in live cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for implementing this Keith
@elasticmachine run elasticsearch-ci/part-1 |
This PR builds on #86524, #87482, and #87306 by supporting the case where there has been no master node in the last 30 second, no node has been elected master, and the current node is master eligible. This is branch 1.2.2.4 in the diagram at #87482 (comment).
The outline of the logic is that when we see that the master node has gone null, we start polling other master-eligible nodes for their ClusterFormationState. Once a diagnoseMasterStability() request comes in we look at the ClusterFormationStates from all of the mater nodes and the result is one of the following:
Note that in this PR we are not returning all of the details described in the diagram (such as which nodes cannot discover which other nodes). Instead we're only giving the details from the local ClusterFormationState. Once we figure out how the details will be used we will add the other information in a later PR.
Here is an example response for case 1 above:
And an example from case 2:
And case 3:
And case 4: